3. Comparison of mouse and human coding genes
Alec MacAndrew
The draft mouse genome was published on 6th December 2002 , Waterstone et al, Nature 420, 520 - 562
Note that this is a 43 page paper (Nature averages 2 -3 pages per paper) with around 200 authors and 330 references. This is all new to science and the volume of material is more than a very fat text book if one includes the references . The detail is published not in a single paper, but in about six related papers occupying more than half of the super fat 6th December issue of Nature. |
Genes and Pseudogenes
The first part of this review focuses on the protein encoding genes. The current human gene catalogue contains just under 23,000 predicted genes in just under 200,000 exons (remember that genes are not continuous blocks of code in the genome but are interrupted by long stretches of non-coding sequence. The coding sections are called exons - averaging about nine per gene - and the interrupting sections are called introns). It is known that this is an incomplete set of genes – see below.2)
Unprocessed pseudogenes arise either from the duplication of a gene in DNA
replication or are degenerated genes that become inactive and are no longer
under selection.
How do we recognise pseudogenes? Well, processed
pseudogenes are all exons and no introns. And both types accumulate
mutations under neutral mutation including things such as multiple
frame-shifts and stop codons. There is another measure (that will also be
important when we look at proteins in a future article) which is the
non-synonymous to synonymous mutation rate, so let's look at it now.
Each amino acid in a protein is coded by a sequence of three bases
called a triplet codon. Now there are 20 amino acids that make up proteins, but
64 different codons (4 possible bases to select from at each of the three
positions which is 43 combinations). That means that most amino acids can
be coded by several different codons.
|
U |
C |
A |
G |
||||
U |
UUU |
Phe |
UCU |
Ser |
UAU |
Tyr |
UGU |
Cys |
UUC |
UCC |
UAC |
UGC |
|||||
UUA |
Leu |
UCA |
UAA |
STOP |
UGA |
STOP |
||
UUG |
UCG |
UAG |
UGG |
Trp |
||||
C |
CUU |
Leu |
CCU |
Pro |
CAU |
His |
CGU |
Arg |
CUC |
CCC |
CAC |
CGC |
|||||
CUA |
CCA |
CAA |
Gln |
CGA |
||||
CUG |
CCG |
CAG |
CGG |
|||||
A |
AUU |
Ile |
ACU |
Thr |
AAU |
Asn |
AGU |
Ser |
AUC |
ACC |
AAC |
AGC |
|||||
AUA |
ACA |
AAA |
Lys |
AGA |
Arg |
|||
AUG |
Met |
ACG |
AAG |
AGG |
||||
G |
GUU |
Val |
GCU |
Ala |
GAU |
Asp |
GGU |
Gly |
GUC |
GCC |
GAC |
GGC |
|||||
GUA |
GCA |
GAA |
Glu |
GGA |
||||
GUG |
GCG |
GAG |
GGG |
The table of mRNA triplet codons and the amino acids they code for - note that in RNA, T (thymine) in DNA is transcribed as U (uracil). mRNA is transcribed from the anti-sense strand of DNA
Now for a gene under selection,
synonymous mutations (ie mutations which substitute one codon for another
coding for the same amino acid) produce the same amino acid and protein (for
example Phe or Phenylalanine
is coded by UUU and UUC. Some amino acids, for example, Leu, are coded by six
different codons) and
are not acted on by natural selection. Non-synonymous mutations produce a
different amino acid and hence a modified protein and are acted on by
Natural Selection. So if we look at fixed synonymous versus non-synonymous
mutations in an active gene we will see a different rate in the two types of
mutation. A pseudogene is
not transcribed to protein, so there should be no difference
between synonymous and non-synonymous mutations (the ratio between non
synonymous and synonymous mutation rates is known as the Ka/Ks ratio).
However, very recent pseudogenes are quite difficult to spot by accumulated
mutations, as they would have been acted on by natural selection for all the
time they were active genes. Nevertheless, the Ka/Ks ratio
in a gene is strong evidence for whether it is active or a pseudogene.
One extreme case of pseudogenes is the Gapdh gene. Mouse has
one functional Gapdh gene but 400 pseudogenes scattered about many of the
mouse's chromosomes (note that this is an exceptional number – don't run
away with the idea that all genes have that many pseudogenes – in fact the average is
likely to be around 1 pseudogene per gene. About 18,000 pseudogenes were
found altogether). Of the 400, nearly 300 are easily identified as
pseudogenes by the methods above, but 100 are recent enough that they needed
to be identified as pseudogenes by careful manual inspection. But the fact
that we now have mouse and human genomes gives us another line of attack:
the pseudogenes on the mouse genome do not have a corresponding homologous
gene in the same syntenic position in humans whereas the active gene
does.
By looking suspiciously and closely at predicted mouse genes that
fail to have a human homologue in a syntenic location, there were 4,000
found that were actually pseudogenes rather than real genes. The average
number of exons in these pseudogenes was less than half that in actual genes
(as many have been deleted once the gene has become inactive and this is just as predicted. Of the total of 18,000 pseudogenes found
(14,000 clearly such, plus the 4,000 previously classified as genes) more
than half are processed pseudogenes (they have no introns). There are probably a good
many more pseudogenes that haven't been identified because they are ancient
and have decayed so far owing to neutral mutation of millions of years that they are unrecognisable – see the article on
repeat sequences.
Comparison of Mouse and Human gene sets
Now,
having identified which sequences are pseudogenes and having removed them from the gene
catalogue, it is possible to do
a comparison of the mouse and human gene sets. At the time of publication of
the draft mouse genome, the headline
writers in popular publications came up with sensational and unjustified claims such as "Mouse 99% same as Human"
and other
misleading statements. This is what was actually determined: 99% of mouse genes have homologues in
man (the actual protein similarity is much less than 99%. See
article on mouse proteins.) Of these, 96% are in the same syntenic location in man as in mouse. 80%
of mouse genes that have a match on the same syntenic region in man are also
the best match for that human gene. These are called 1:1 orthologues, ie not
just similar genes but genes that have descended and diverged from a common
ancestor.
The less than 1% mouse genes (118) with no homologues in
humans do have homologues in other species. So we can explain them as
follows: either the corresponding gene has been deleted from the human
genome or it is rodent specific (unlikely since they are all known in other
organisms) or the corresponding gene has not been found in humans yet or
they might be evolving so rapidly in one or other lineage that they are
unrecognisable as homologues.
A
completely different method for predicting genes (not based on looking for
sequences which code known proteins) was also used. This identifies genes
by looking for statistical properties of coding regions, TATA boxes, UTRs, splice sites,
introns etc. This process is enhanced by applying it to two genomes
simultaneously and it was applied to the human and mouse sequences. This
technique found a possible further 12,000 exons beyond the existing
catalogue. By sampling a subset and checking the predictions experimentally
it seems that about 6,000 of these are actual active exons yielding about
1000 additional genes.
Number of genes in the mammalian genome
How many genes does a mammal have? Well the
current count of predicted genes in human and mouse is about 23,000 with
190,000 exons. But there is a database of complementary DNA from mammals
(cDNA is transcribed from mRNA present in different mouse tissues using reverse
transcriptase and
corresponds to the exons in genes). 79% of known mouse cDNAs are in the
predicted mouse exons from the sequence – so we are missing about 21%.
Taking that and the fact that not all cDNAs have been identified and that
some predictions are false positives gives an exon count of about 225,000 –
250,000. From other data, we know that there are on average 8.3 exons per
mouse gene and that would give 27,000 – 30,000 genes in the mammalian
genome. Although the number has fluctuated wildly in the last 3 years, we
seem to be homing in on a number around 30,000.
However, if there are
small single exon genes not strongly expressed they would not be detected
and would not be included in the 30,000.
RNA Genes
Finally the researchers looked
at RNA genes. These genes do not code for proteins but for RNA including tRNA
used for transferring amino acids to the poly peptide chain in the ribosome. The human catalogue had 518 tRNA genes and 118
pseudogenes. It is much more difficult to identify tRNA genes in mouse
because mouse has an active SINE (repeat sequence - see earlier post) that
is derived from tRNA and leaves debris scattered about the genome that looks
like RNA genes. At first pass the researchers found 2,764 RNA genes and
22,000 pseudogenes but the vast majority were masked out as SINEs. That
left 498 possibles. But we expect active tRNA genes to be extremely highly
conserved across species. If we include only genes with 95% sequence
identity we find 335 in mouse and 345 in man of which about 250 are
absolutely identical. That set includes all the 46 expected anti-codons
(used to transcribe the 61 possible codons). There are 46 anti-codons to
translate the 61 sense codons because of the famous Crick wobble rules that
state that the base in the third position of an anti-codon can correspond to
two different bases and so a single anti-codon can translate more than one
different codon. This occurs without loss of information since, you will remember,
the set of codons has redundancy. So 61 codons translate the 20 amino acids
via 46
anti-codons.
Conclusion
Would any of this be possible without common descent? The increased richness that is possible by comparing the mammalian genomes rather than just looking at one is astonishing. We learn a great deal by comparing the genomes and the value of doing that relies entirely on the relationship and common descent of human and mouse.
1. Go here for an excellent very detailed review of pseudogenes and their implications for the evolution/creation debate: Edward Max, Plagiarized Errors and Molecular Genetics